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Abstract. - Tree reconstruction methods are often judged by their accuracy, 
measured by how close they get to the true tree. Yet most reconstruction methods like 
ML do not explicitly maximize this accuracy. To address this problem, we propose a 
Bayesian solution. Given tree samples, we propose finding the tree estimate which is 
closest on average to the samples. This "median" tree is known as the Bayes estimator 
(BE). The BE literally maximizes posterior expected accuracy, measured in terms of 
closeness (distance) to the true tree. We discuss a unified framework of BE trees, 
focusing especially on tree distances which are expressible as squared euclidean 
distances. Notable examples include Robinson-Foulds distance, quartet distance, and 
squared path difference. Using simulated data, we show Bayes estimators can be 
efficiently computed in practice by hill climbing. We also show that Bayes estimators 
achieve higher accuracy, compared to maximum likelihood and neighbor joining. 

key words: Bayes estimator, consensus tree, path difference metric, phylogenetic 
inference. 



Introduction 



When a large phylogeny is reconstructed from sequence data, it is typically expected 
that the reconstructed tree is at least slightly wrong, i.e. slightly different than the true 
tree. We refer to the difficulty in accurately reconstructing phylogenies as tree 
uncertainty. 

Tree uncertainty is a pervasive issue in phylogenetics. To help cope with tree 
uncertainty, bootstrapping and Bayesian sampling methods provide a collection of 
possible trees instead of a single tree estimate. Using bootstrapping or Bayesian 
sampling, one common practice is to identify highly supported tree features (e.g. splits) 
which occur in almost all the tree samples. Highly supported features are regarded as 
likely features of the true tree. 

Similarly, in simulation studies it is co mmon to judge reconstruct ion methods 



based on how close they get to the true tree (iDesper and Gascuell (120041 ) ). Closeness to 
the true tree can be measured in many different ways. One popular measure of closeness 
is the Robinson-Foulds (RF) distance (also known as symmetric difference). 

These customary practices reflect a common view that when tree uncertainty is 
likely, a good reconstruction method ought to at least find a tree which is close to the 
true tree. For example, if multiple trees have high likelihood, then a good tree estimate 
should be an "accurate representative" of the high likelihood trees. Yet, reconstruction 
methods like maximum likelihood (ML) are not directly designed to achieve this goal. 
This leads us to ask whether reconstruction accuracy (i.e. closeness to the true tree) can 
be improved, by attempting to directly optimize accuracy instead of likelihood. 

Even though the true tree is unknown, we can still optimize reconstruction 
accuracy using a Bayesian approach. In the Bayesian view, the true tree is a random 
variable T distributed according to the posterior distribution P(T | D), where D is input 
data such as sequence data. If d() measures distance between trees, and T' is a tree 



estimate, then the expected distance between T' and the true tree is Ey~p(T| D)d(T, T'). 
Thus, to maximize reconstruction accuracy, we should choose our tree estimate to be 
T* = a.rgmm T ,E,T~p(T\D)d{T,T') where T* is known as a Bayes estimator. 

Many popular distances between trees can be easily expressed as a squared 
euclidean distance, after embedding trees in an appropriately chosen vector space. 
Important examples include Robinson-Foulds distance (symmetric difference), quartet 
distance, and the squared path difference. In this paper, we focus on squared euclidean 
distances. 

In statistical decision theory, Bayes estimators under squared euclidean distance 
are well understood and have nice properties. For example, under a squared euclidean 
distance, the Bayes estimator m i nimiz es the distance to the mean of the posterior. This 



gives the result in ([Holder et al. 



20081 ): The majority- rule consensus tree is the Bayes 
estimator, if closeness between trees is defined by Robinson-Foulds distance. We also 
derive a closely related result for quartet distance: Under quartet distance, the Bayes 
estimator tree is equivalent to a weighted quartet puzzling problem. 

In general, computing Bayes estimators is at least as hard as computing ML 
trees. Hill climbing techniques are popular and effective heuristics for hard tree 
optimization problems such as ML. Thus we propose hill climbing to compute Bayes 
estimator trees as well. For squared euclidean distances, each hill climbing step is quite 
fast, comparable to a traditional ML hill climbing step. 

We provide a simulation study of Bayes estimators using the path difference 
metric. We use hill climbing with nearest neighbor interchange (NNI) moves to find 
Bayes estimators. We observe that hill climbing is fast in practice, after the 
preprocessing step of sampling the posterior on trees. More importantly, we observe that 
Bayes estimator trees are more accurate on average, compared to ML and neighbor 
joining (NJ). These results comprise an encouraging pilot study of Bayes estimators. We 
conclude by discussing improvements and directions for future work developing Bayes 



estimators for phylogeny. 

BAYES ESTIMATORS AND SQUARED EUCLIDEAN DISTANCE 

Let D denote a collection of homologous sequences from n species. Many 
evolutionary models exist which express P(D \ T, 9) in terms of an underlying 
phylogenetic tree T on the n species, and evolutionary rate parameters 9. Given such a 
model, and observed sequence data D, there are two main methods for sampling trees T 
which could have generated D: 

• The Bayesian method, which declares a prior P(T) on tree topologies, and uses 
sampling techniques such as Monte Carlo Markov Chain (MCMC) to 
approximately sample from P(T | D) oc P{T)P{D \T), 

• The bootstrap method, which creates hypothetical datasets Di by bootstrapping 
columns from an alignment of D, and then computes a tree Tj = T(Di) for each Di 
by applying a tree reconstruction method such as ML or NJ. 

The notation P(T\D) is not entirely appropriate for the distribution on trees 
obtained by the bootstrap method. Nevertheless, for convenience we will use the 
notation P(T\D) for the obtained distribution, regardless of whether the Bayesian or 
bootstrap method is used. 

Given a measure of dissimilarity (or distance) d(T, T') between phylogenetic trees 
on n taxa, the (posterior) expected loss associated with a tree T" is Kd(T, X"), where the 
expectation is taken over T, distributed as P(T\D). We write p(T') for the expected 
loss. The Bayes estimator T* minimizes the expected loss: 

T* = argmin T , p{T') 

In other words, regarding the true tree T as a random variable distributed as P(T\D), 
the Bayes estimator is the tree T* which is closest to T on average. 



Bayes estimat ors are a common tool in statistical optimization and decision 



theory (IBerger 



19851 ). Given a finite sample T±, . . . , Tjy from P(T\D), the empirical 
expected loss is p(T') = X^=i d(T', Tj), and the empirical Bayes estimator is the tree 
that minimizes the empirical expected loss. In this paper we will focus on empirical 
Bayes estimators for a given sample, and so we will simply say "Bayes estimator" when 
we mean the empirical Bayes estimator. 

Squared euclidean distances between trees 

Let T n be the space of trees on n taxa. We call d(-, •) a squared euclidean distance 
if there is a function v : %, — > IR m for some m, such that 



d(T,T') = \ \v(T)-v(T 



'M|2 



We call v() a (vector space) embedding. Recall that for two vectors a = (a±, . . . , a m ), 
b = (b u . . . , b m ) in R m , we have | \a\ | 2 = Y7=i a l and 1 1° ~ b \ ? = \ \ a \ ? + I ? ~ 2 ( a ' b ) 
where a ■ b denotes the dot product YlT=i a «^«- 

Many popular distances between trees are squar ed euclidean distances. Below we 



1993|). For 



list several such distances, all of which were studied in (jSteel and Pennyl . 
each distance, we illustrate the vector space embedding and the distance using the two 
trees T\ and T2 shown in Figure [3] (no branch lengths) and Figure H] (branch lengths). 



Example 1 Let S(T) den ote the set of splits induced 



Robins on-Foulds distance /(Robinson and Foulda . 



by a tree T. The (normalized) 



198 n) dRF{T,T f ) is half the size of the 



symmetric difference (S(T) — S(T')) U (S(T') — S(T)). The Robins on-Foulds distance 
can also be realized as the squared euclidean distance 

d RF {T\T)= l -\\v RF {T)-v RF {T')\\ 2 



where v RF '■ %i 



f>2 n - L -l 



maps tree T to the 0/1 vector v RF (T) whose nonzero entries 



correspond to splits in T. For example, for the trees T\ and T 2 in Figure^ we have 



and 



vMTi) = (1,1, i, 1,1,1, o, 0, 0, 0, 0, 1, 0, 0, 0), 



vrf{T 2 ) = (1,1, 1, 1,1,1, 0, 0, 1, 0, 0, 0, 0, 0, 0), 



d RF (T u T 2 ) = ^WvrfW-Vrf^W 2 = 1. 



Here the coordinates of vrf(Ti) and vrf{T 2 ) are given by 



({A}, {£?}, {C}, {D}, {E}, {A, B}, {B, C}, {A C}, {C, D}, 

{B, D}, {A, D}, {D, E}, {C, E}, {B, E}, {A, E} 



where for example {B, D} corresponds to the partition { {B, D}, {A, C, E} }. 



Exampl e 2 Let Q(T) denote t he set of quartets induced by a tree T. The quartet 



distance /(Estabrook et al 



198a) (Iq(T,T') is half the size of the symmetric difference 
(Q(T) — Q{T')) U (Q(T') — Q(T)). Analogous to the Robins on-Foulds distance, dQ can 
be realized as a squared euclidean distance, 



where v Q : T n 



d Q (T',T) = ~\\v Q (T)-v Q (T')\\ 2 



maps tree T to the 0/1 vector Vq(T) whose nonzero entries 



correspond to quartets in T. For example, for the trees T x and T 2 in Figure^ we have 



v Q (T 1 ) = (1,0, 0,1, 0,0, 1,0, 0,1, 0,0, 1,0,0), 



v Q (T 2 ) = (1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0), 



and 



d Q (T 1 ,T 2 ) = -|K ; q( t i) -vq(T 2 



2. 



Here the coordinates of vq{Ti) and vq(T 2 ) are given by following cherry groupings (two 
leaves with the same parent node) 



{AB, CD}, {AC, BD}, {AD, BC}, {BC, DE}, {BD, CE}, {BE, CD}, {AB, CE}, {AC, BE}, 
{AE, BC}, {AC, DE}, {AD, CE}, {AE, CD}, {AB, DE}, {AD, BE}, {AE, BD} ) . 



Example 3 For T G T n , let Dt G M\ 2 ) be the matrix of pairwise distances between 

leaves in T. The squared dissimilarity map distance is defined as 

dr)(T',T) = \ \Dt — Dr> II 2 - The d issimilarity map distance is perhaps one of the oldest 



studied, see e.g. \Buneman . 
have 



191 n ). For example, for the trees T\ and T 2 in Figure^ we 



and 



D Tl = (5.3, 9.0, 15.2, 12.4, 6.1, 12.3, 9.5, 10.8, 8.0, 8.0), 



D T2 = (3.5, 11.3, 13.2, 10.9, 12.0, 13.9, 11.6, 7.1, 7.0, 8.9), 



d D {T 1 ,T 2 ) = \\D Tl -D T2 \\ 2 = 12M. 



Here the coordinates of Dt x and Dt 2 are given by 



(-^1,2) ^l,3,j D\ t 4, D 2 3, D 2 ^, . . . , .^4,5 J : 



where Dij is the sum of branch lengths of the path from leaf i to j. 

Example 4 The distances dnF(T,T') anddq{T,T') are topological distances, i.e. they 
only depend on the topologies ofT,T', and not edge lengths. The dissimilarity map 
distance does depend on edge lengths, but it has a natural topological analog called the 



path difference metric. The squared path difference is 



d p (T\T) = \\v p (T)-v p (T 



'M|2 



where v p {T) e M\ 2 ) is the integer vector whose ijth entry co unts the number o 



edge s 



1993) 



between leaves i and j in T. Path difference was studied in /[Steel and Pennv\, \1993i ). 
Note that in our notation, we have squared the norm, whereas ((Steel and Pennu , 
defined d p (T',T) = \\v p (T) — u p (T')||. 

For example, for the trees T\ and T 2 in Figure [3J we have 



and 



Up (T 1 ) = (2,3,4,4,3,4,4,3,3,2), 



^(T 2 ) = (2,4,4,3,4,4,3,2,3,3), 



d p (T 1 ,T 2 ) = \\v p (T 1 )-v p (T 2 )\\ 2 = 6. 



Here the coordinates of v p (Ti) and v p {T 2 ) are given by 



Ul,2, ^1,3,^1,4, ^2,3, ^2,4, • • • , ^4,5 I , 



where v^j is the number of edges between leaf i and j. 

The above examples highlight the fact that many combinatorial distances can be 
interpreted as squared euclidean distances. Under a squared euclidean distance, the 
Bayes estimator is the projection of the mean onto the nearest tree. More specifically, if 
d(T,T') = \\v(T) — v(T')\\ 2 is a squared euclidean distance, then evidently 

p(T') = \ \v(T') -jj\\ 2 + Var 



where /i = E[w(T)] and \i 2 = E[ ||t?(T)|| 2 ], and Var = \i 2 — ||/i|| 2 does not depend on T' . 



For example, under the Robinson-Foulds distance, the Bayes estimator is 
obtained by projecting the vector of split frequencies hrf = ^vrf(T) onto the nearest 
0/1 vector Vrf(T*) G {vrf(T')} T i C {0, l} 2 " -1 . If we relax this problem, and simply 
project hrf onto the nearest 0/1 vector v* G {0, l} 2 ™ -1 , then we see v* is obtained by 
rounding all entries in flap to the nearest integer or 1. I n other words y* = V rf(T*) 



Holder et al. 



fl2008h : the 



where T* is the consensus tree. Thus we have the result in 
consensus tree is the Bayes estimator for Robinson-Foulds distance. 

In our view, projecting a point (e.g. input dissimilarity map) to a nearby tree is a 
geometric analog of a Bayes estimator. Indeed, distance-based tree reconstruction 
methods can be loosely regarded as "projections" of an input dissimilarity map 
D G r( 2 ) onto a tree metric = D — e, where e is "small" according to some norm. 
The geometry of distance-based tree reconstruction methods has been studied be fore, 



see ( lEickmeyer et al. 



2008 



Eickmeyer and Yoshida 



2008 



Mihaescu et al. 



2003). 



Relation between Bayes estimators and existing reconstruction methods 



Quartet puzzling 

Under the quartet distance cLq(T,T') = \\vq(T) — vq(T')\\ 2 , the Bayes estimator 
is the tree T* which minimizes ||i>q(T) — /xq|| 2 , where \xq = Kvq(T) is the vector of 
posterior quartet frequencies. Since ||wq(T)|| 2 = (™) for all trees on n taxa, we have 



\\ v q{T) -1*q\ 



+ H/iqII 2 - 2v Q (T) ■ fi Q = (constant) - 2v Q (T) ■ fi Q 



and so the Bayes estimator T* can be equivalently defined as T* = argmax T/ u,Q • vq(T). 
Maximizing fiQ ■ vq(T) is a weighted quartet puzzling problem: Given a set of weights fiQ 
on quartets, find a compatible set of quartets of maximal weight. If all quartet weights 
are 0/1, then we obtain the t raditi onal quartet puzzling problem 



fjStrimmer and von Haeseler 



1996|) 



Analogous to split frequencies and the consensus tree, we can use a sample of 
trees to estimate quartet frequencies, and then apply weighted quartet puzzling to find 



the Bayes estimator tree. In genera . 



quartet puzzling) is NP-hard 



thou gh, quartet puzzling (and hence weighted 



Steel! (1l992f ) . However, the r e has been considera b 



progress toward solving large instances: see 



Erdos et al 



fll997h : 



Snir and Raol (120091 ) for 



example. In our case, the weights /xq have special structure since they are realizable by 
a collection of trees; this might make the weighted quartet puzzling we are considering 
here easier. 

Ordinary Least Squares (OLS) minimum evolution (ME) 

For a squared distance dissimilarity map, there is a striking similarity between 
Bayes estimators and the minimum evolution (ME) approach to phylogenetic 
reconstr uction. ME methods are distance-based methods that have been extensively 



studied ( jHolder and Lewis 



2003 



Rzhetskv and Neil. 



is Ordinary Least Square 



1993 g) . One o f the e arliest examples 



Desper and Gascuel 



(OLS) ME (Edwards and Cavalli-Sforza 



1963 



20021 ). OLS ME first estimates the branch lengths for each tree 
topology T by minimizing \ \Dt — D\\ 2 , where D is the input dissimilarity map. Then 
the outputted tree topology T* is the topology whose sum of estimated branch lengths is 
minimal. If D = Dt + e, where Dt is a tree metric and e comprises i.i.d. errors with 
mean 0, then OLS ME is statistically consistent as a method to recover Dt- 

There is however a key difference between OLS ME and minimizing the expected 
squared dissimilarity map distance. The input to OLS ME is a dissimilarity map 
presumed to be of the form D = Dt + e. In sharp contrast, the mean /x summarizes the 
posterior distribution on Dt, given input such as sequence data. Although /x could be 
viewed as a random variable whose distribution is governed by the true underlying tree 
T, the form of this distribution P(/x |T) is opaque and depends on the model of sequence 
evolution being used. Thus, while directly minimizing \ \Dt — /i|| 2 produces the Bayes 



estimator T*, it is not clear whether the minimum evolution approach (treating /jasa 
"perturbed tree metric") is a sensible alternative. 



Hill climbing optimization 



Since the number of tree topolog 1GS Oil Tl tcLXcL grows exponentially in n, 
computing the Bayes estimator T* under a general distance function can be 
computat ionally hard. However, hill c limbing; techniques such as those used in ML 



methods (IGuindon and Gascuel 



20031 ) often work quite well in practice for tree 
reconstruction. Hill climbing techniques can similarly be used to find local minima of 
the empirical expected loss. 

Hill climbing requires a way to move from one tree topology to another. Three 
types of combinatorial tree moves are often used for this purpose; Nearest Neighbor 
Interch ange (NNI), Subtree-Pru ne- and-Regraft (SPR), and Tree- Bisection- Reconnect 



(TBR) ( ISemple and Steell . 120031 ) . SPR and TBR moves are more general than NNI, but 
every SPR and TBR move is a composition of at most two NNI moves. SPR and TBR 
moves endow e ach tree with Q( n 2 ) nei ghbors. NNI moves produce a smaller set of 0(n) 



neighbors. See (lAllen and Steell (120011 )) f or details. PHYML uses 



NNI moves when hill 



climbing to quickly search for a ML tree (IGuindon and Gascuell . 



20031 ). We follow their 



example and choose NNI moves to apply hill climbing. 

For each proposed move T current — > T new during hill climbing, p(T new ) must be 
computed. A straightforward evaluation using the definition p(T new ) = 
~k YliLi d(T new , Tj) requires A" evaluations of d(), where A" is the sample size. For 
squared euclidean distances the situation is often much better since p(T new ) can be 
re-expressed (up to an additive constant), as simply p(T new ) = d(jl,T ne ' w ), where 
jl = ^i=i v (Ti) is the sample mean. Note fi does not depend on the tree T™, thus it 
can be computed once at the beginning of hill climbing. Consequently, at each step we 
need only evaluate d() once. The computational expense to calculate d() depends on the 



choice of vector space embedding. 



Simulation study: Methods 

For studying Bayes estimators, a natural first choice for distance between trees is 
Robinson-Foulds distance. But then the Bayes estimator is the consensus tree, which 
has been extensively studied, and is easy to compute from samples. We thus sought out 
other important distances besides Robinson-Foulds. 

The dissimilarity map distance is one of the oldest distances for the comparison 
of trees, and lies at the foundation of distance-based reconstruction methods. Thus, 
dissimilarity map and related distances are a natural choice for case-study of Bayes 
estimators. We specifically chose the (squared) path difference metric. The path 
difference metric ||i> p (T) — i> p (T")|| is precisely the dissimilarity map distance 
\\D T — D T i\\, if all edge lengths in T,T' are redefined to be 1. Setting all edge lengths to 
1 prevents deemphasis of the shorter (presumably uncertain) edges. Intuitively, this 
emphasizes topological accuracy in the Bayes estimator. We believe this is a desirable 
property, and we are n ot the first to suggest its importance. The conclusion of 



Steel and Pennyl (119931 ) states 



"The path difference metric, d p , has several interesting features that suggest 
that it merits more study and consideration for use when studying 
evolutionary trees. These features will make it particularly attractive when 
studying large trees. ■ • • The d p metric may be the method of choice when 
trees are more dissimilar than expected by chance." 

Thus we chose the squared path difference as a case study. We think quartet 
distance would also be interesting, but believe that a study of Bayes estimators under 
quartet distance should include quartet puzzling methods, given the close connections 
outlined in the Quartet Puzzling section. We have therefore deferred study of quartet 
distance. 



Under the path difference metric, trees are embedded in r( 2 ). Using depth-first 
search on a tree T, the embedding vector v(T) can computed in 0(n 2 ) time. Euclidean 
distance in r( 2 ) can also be computed in 0(n 2 ) time. 

Simulated data 

For simulated data, we used the first 1000 examples from the data set presented 



in 



Guindon and Gascuell (120031 ). We briefly review the details of the data set. Trees on 



40 taxa were generated according to a Markov process. For each generated tree, 40 



homologous sequences (no in dels) of 



two-parameter (K2P) model (IKimural. 



Specifically the Seq-Gen program (IRambaut and Grasslyl . 



ength 500 were generated, under the Kumura 



19801). with a tran s ition / transversion ratio of 2.0. 



19971 ) was used to generate 



the sequences. The data is available from the website 



http : / /www . atgc-montpellier . f r/phyml/datasets . php 



Reconstruction methods 



For each se t of homologous sequences D in t he simulated data, we used the 



software MrBayes ( iHuelsenbeck and Ronquist 



20011 ) to obtain 15000 samples from the 



posterior distribution P(T\D). Specifically, we ran MrBayes under the K2P model, 
discarded the initial 25% of samples as a burn-in, used a 50 generation sample rate, and 
ran for 1, 000, 000 generations in total. 

We comp uted a ML tree estimate for each data set, using the hill climbing 



software PHYML ( IGuindon and Gascuel 



20031 ) as described in the 



paper. We also 



19891 ). using pairwise 



computed a NJ tree using the software PHYLIP ( iFelsensteinl . 
distances computed by PHYLIP. 

We then used our in-house software to minimize the expected path difference 
squared euclidean distance by hill climbing. We performed hill climbing using NNI 
moves, along with various choices of starting trees. For starting trees we used the NJ 
tree, the ML tree, and five samples from P(T\D) (NJ and ML trees were computed as 



described above). We also used the MrBayes tree sample which had the highest 
likelihood, which we call the "empirical MAP" tree. 

We now briefly describe our hill climbing implementation. The input for the 
algorithm is a list of trees T 1; . . . , T/v sampled from P(T | D), and an initial starting tree 
T°. The pseudo-code is as follows: 

Algorithm 5 (Hill climbing from an initial tree T°) 

INPUT: Samples 2\, . . . , T N , and an initial tree T°. 
OUTPUT: Local minimum T* of the empirical expected loss. 
PROCEDURE: 

BEGIN 

Compute and store fi p = Yli v p(Ti)- 
Initialize T* = T°, and p* = \\v p {T°) - jl p \\ 2 . 
DO: 

Pick an NNI neighbor T new ofT*. 
Compute p n p ew = \ \v p {T new ) - fi p \\ 2 . 

TT? „new ^ n * . 
lt Pp < Pp- 

Set T* = T new and p* p = p n p ew . 
END IF 

UNTIL p* p < p r p iew is satisfied for all neighbors T new ofT*. 

Output T* 

END 

In practice, allowing the hill climbing algorithm to run until complete 
convergence might take too long. Thus, we included several alternative stopping criteria 



in the UNTIL statement. (For example, halt if a maximum number of loop iterations is 
reached.) In our simulation study, the algorithm always found a local maximum before 
halting. The source code, written in java, is available at 
http : / / cophylogeny . net/research . php . 

Simulation study: Results 
Comparing objective functions for tree reconstruction 

In our distance-based framework, the canonical measure of reconstruction 
accuracy is the distance, d p (T* ,T true ) = \\v p (T*) — v p (T true )\\, between the true tree 
rptrue an( j estimated tree T*. When reconstructing a tree, ideally we would like to 
directly use distance to the true tree as the objective function. But obviously this is 
impossible unless T true is known. One obvious question is: How good are other objective 
functions, such as likelihood and p p , as proxies for d p (-,T true )7 The relationships among 
objective functions are particularly important for nearly optimal trees. 

We explored this question using the simulated data. For each of the 1, 000 data 
sets, we computed three scores for each of the 15, 000 MrBayes samples T;, 
i = 1, . . . , 15000. The three scores we investigated are 1) The observed frequency of the 
tree topology in MrBayes samples, 2) The empirical expected loss: p p (Tj) 
= ||u p (Tj) — Y5500 J2j v p(Tj)\\ 2 , and 3) The actual distance to the true tree: d p (Ti,T true ) 
= ||V^)-^(r*™ e )|| 2 . 

For each data set, we restricted our attention to the 25 most frequent tree 
topologies. The number of samples 15, 000 was large enough so that the frequencies of 
the 25 most probable tree topologies could be estimated fairly well in most cases. For 
the 25 most probable topologies, we computed the Kendall-tau correlations between the 
three scores and recorded the results in Table [31 

If there are no ties among the 25 topologies under any of the scores, then the 
Kendall-tau has a natural interpretation: If P(s2(T) < S2(T')\si(T) < Si(T') = p for a 



randomly drawn pair T, T' of the 25 topologies, then the Kendall-tau correlation is 

2p — 1 between the scores s\, S2- As Table |3] shows, our proposed empirical expected loss 

p p outperforms likelihood, as a proxy for the distance to the true tree. 

Performance of tree reconstruction methods 

As described in (Simulation Study: Methods), for each simulated data set we 
computed NJ, ML, and empirical MAP trees. We then performed NNI-based hill 
climbing to optimize p p , using NJ/ML/MAP as starting trees as well as starts chosen 
randomly from MrBayes samples. We estimated the Bayes estimator (BE) tree by taking 
the best of five r andom starts. 

Following iGuindon and Gascuell (120031 ). we plotted the inaccuracy (path 



difference to true tree) of the NJ, ML, empirical MAP, and BE trees (Figured]). Notice 
we have reported the inaccuracy between trees T,T' as the norm ||i> p (T) — f p (T')||, 
instead of the square ||u p (T) — f p (T')|| 2 . We chose to do this so that the inaccuracy can 
be loosely interpreted as "average difference of number of edges between a typical pair of 
leaves." In the plot, inaccuracy is plotted against the maximum unadjusted pairwise 
divergence in the sequence data. The unadjusted pairwise divergence between two 
sequences is the proportion of sites where both sequences differ. 

We also give an analogous plot (Figure [2]), plotting the empirical expected loss 
p p {T) for the various tree estimators. Note the true tree might not be the global 
optimum of p p {T). Thus we included the true tree in the plot as well. 

Tables (CQ) and (j2J) summarize the results of our NNI-based hill climbing when 
ML/NJ/empirical MAP trees are used as the starting tree. Note the ML tree (computed 
by phyML) was obtained by NNI hill climbing optimizing the likelihood. Our hill 
climbing optimizes p p instead, so it is possible an NNI move can improve the phyML 
tree. 

We indeed observed that NJ, ML, and empirical MAP trees can be improved by 
hill climbing. (Table [2]) and (Table [I]) give summary information. In particular, (Table 



[T]) shows that our hill climbing algorithm improves the distance to the true tree. We find 
this particularly encouraging. 

Using a Pentium dual core system running Red Hat Linux 4, each run of our hill 
climbing programs required between 1 minute and 1.5 minutes on average per example, 
depending on the starting tree. Using the NJ tree as the initial tree took longer on 
average, because more hill climbing steps were required to find a local optimum. 

Discussion 

For phylogenetic reconstruction, the Bayes estimator is a natural choice when 
recovering the true tree is unlikely, and one is content to find a tree which is "close" to 
the true tree. Here "close" is defined by a choice of distance between trees, e.g. 
Robinson-Foulds distance. The Bayes estimator directly maximizes its expected 
accuracy, measured in terms of closeness to the true tree. In contrast, ML optimizes 
likelihood instead of a ccuracy. 



As observed in 



Holder et al. 



(120081 ). the popular consensus tree has a natural 
interpretation as the Bayes estimator which minimizes the expected Robinson-Foulds 
distance to the true tree. Thus, for the special case of Robinson-Foulds distance, Bayes 
estimators have actually been studied for quite some time. 

As part of an exploratory simulation study, we showed that hill climbing can be 
used to find an empirical Bayes estimator in practice, given a sample of trees from the 
posterior dis tribution. In par t icular we used the squared path difference metric 



described in ISteel and Penny! ( 119931 ). Hill climbing optimization produced tree estimates 
which were closer to the true tree, outperforming NJ and ML. And in the majority of 
cases, hill climbing improved distance to the true tree, even when the initial tree was 
obtained by hill climbing optimization of the likelihood. We consider this very 
encouraging for future work on hill climbing approaches for Bayes estimators. 

Systematists are best qualified to help choose which types of distances should be 
used to compare trees. On the theoretical front, some interesting new distances are 



being; studied such as th e geodesic distance (IKupczok et al. 



2008 



Owen 



2008 



2009 



Owen and Provanl . 120091 ). We believe Bayes estimators (or "median trees") under novel 
distances comprise an interesting direction for future mathematical research. We also 
think Bayes estimators under the classical quartet distance might be interesting, in light 
of the close connection to quartet puzzling. 

In this paper we used NNI moves to apply the hill climbing algorithm. One could 
also try more general tree m oves such as SPR or TBR, analogous to 



(IHordijk and Gascuel 



20051 ). It would be interesting to study which tree moves give 



faster hill climbing convergence for Bayes est imators in practice. Similarly, exploration 



strategies such as Tabu search (j Glover 
performance. 



19861 ) or simulated annealing may give better 



For some vector space embeddings (e.g. quartet embedding vqQ), the embedding 
vectors for trees may be rather high- dimensional and non-sparse. Then it may be faster 
to use the naive definition p(T) = YliLi ^) directly. Indeed, qu artet distance 



dg{T , T') can be computed in 0(n\ogn) time for two trees on n taxa (IBrodal et al 



20011 ). which is much faster than operations on the vectors vq(T),vq(T') which have 
dimension 0(n 4 ). 

In this paper, we have focused on different types of tree features that can be used 
to define distance, e.g. splits or quartets. Systematists are particularly interested in 
splits. T hus, one could also s tudy different ways to define a distance based on splits. For 



example, 



Holder et al. 



(120081 ) considered a generalized Robinson-Foulds distance that 
allows a specificity/sensitivity trade-off. We think another interesting way to modify 
Robinson-Foulds distance would be to make the distance more "local" . For example, 
one could define a transformed distance d(T,T') = mm(dRp(T, T'), K) for a given 
"ceiling" constant K > 0. Then the Bayes estimator could be interpreted as a 
"smoothed" ML tree, i.e. the ML tree after the likelihood has been smoothed by a local 



convolution. This smoothed ML tree could provide a nice compromise between ML trees 
and consensus trees. 

Finally, we note that in our simulation study, ML trees were quite accurate. In 
fact, the ML tree was typically quite close to the Bayes estimator, in terms of NNI 
moves. Thus an ML (or approximate ML) tree might be quite useful as an initial guess 
for a Bayes estimator tree. Then, one could "polish" the MLE by using hill-climbing 
optimization of the expected loss. 
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Table 1. Using MrBayes under the K2P model for 1,000,000 generations sampling 

every 50 generations. We then removed the first 5, 000 sampled trees out of 20,000 

sampled trees as 25% burn- in to obtain 15,000 samples from the posterior distribution 

P(T\D). We give the performance of hill climbing, applied to several different initial 

trees. The first two columns summarize how the local minimum compared to the initial 

tree, on the 1000 simulated data sets. The third column gives the average percentage by 

which hill climbing decreases the path difference distance to the true tree. This is 

computed as 1— mea.ia(d in i tia i / df ina i) , where mean() denotes denotes geometric mean. If 

either the initial or final distance to the true tree is zero, we add 1 to both distances. 
Initial tree Hill climbing improves Hill climbing worsens Avg drop in 

distance to T true l distance to T true ? distance to T true 

ML tree 380 253 5.9% 

Empirical MAP tree 508 185 17.9% 

NJ tree 693 229 39.6% 



Table 2. Using MrBayes under the K2P model for 1,000,000 generations sampling 

every 50 generations. We then removed the first 5, 000 sampled trees out of 20,000 

sampled trees as 25% burn- in to obtain 15,000 samples from the posterior distribution 

P(T\D). The first two columns summarize how the local minimum compared to the 

initial tree, on the 1000 simulated data sets. The third column gives the average 

percentage by which hill climbing decreases p p . This is computed as 1 — 

mean( J p^ tial / pf mal ) , where mean() denotes denotes geometric mean. If either p l ™ tial 

or p p hnal is zero, we add 1 to both. 

Initial tree Hill climbing improves Hill climbing worsens Avg drop in 



ML tree 

Empirical MAP tree 
NJ tree 



690 
870 
961 









5.9% 
8.6% 
20.3% 



Table 3. For each of the 1, 000 data sets, we computed three scores for each of the 
15, 000 MrBayes samples T», % — 1, . . . , 15000. The three scores we investigated are 1} 
The observed frequency of the tree topology in MrBayes samples, 2) The empirical 
expected distance p p (Tj) = ||w p (Tj) — J2j v p(Tj)\\ 2 , and 3) The actual distance 
d p {T u T*™) = \\v p (T t ) - Vp (T true )\\ 2 . 

P(Tj) p p (T t ) d p {T h T tr ^) 

P(Ti) ■ 0.352 0.148 

P P (Ti) ■ 0.270 
d p (T u T true ) 



Legends to Figures 




Figure 1. Normalized, unsquared path difference \ \v p (T true ) — u p (T*)|| / y (™) for tree 
estimates T* computed by various reconstruction methods, for 1000 simulated trees 
rptrue on ra = 40 taxa. Here NJ (N) is the neighbor joining tree constructed via neighbor 
in PHYLIP package, ML (L) is the PHYML tree, MAP (M) is the MrBayes sample with the 
highest posterior probability, and Bayes (B) is the Bayes Estimator tree, estimated from 
MrBayes samples. 

Figure 2. Normalized empirical expected distance to the true tree, 



Y Jr J2f=i \ \ v p{T true ) — Vp(T*)\\ 2 I J (™) for tree estimates T* computed by various 
reconstruction methods, for 1000 simulated trees T true on n = 40 taxa. Here NJ (N) is 
the neighbor joining tree constructed via neighbor in PHYLIP package, ML (L) is the 
PHYML tree, MAP (M) is the MrBayes sample with the highest posterior probability, and 
Bayes (B) is the Bayes Estimator tree, estimated from MrBayes samples. 

Figure 3. Two trees on five taxa with different topologies, T\ (left) and (right). 




Figure 4. Same two trees in Figure [3] with branch lengths assigned, T\ (left) and T2 
(right). 
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